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Abstract 

Structural kernels are a flexible learning 
paradigm that has been widely used in Natural 
Language Processing. However, the problem 
of model selection in kernel-based methods 
is usually overlooked. Previous approaches 
mostly rely on setting default values for ker¬ 
nel hyperparameters or using grid search, 
which is slow and coarse-grained. In con¬ 
trast, Bayesian methods allow efficient model 
selection by maximizing the evidence on the 
training data through gradient-based methods. 
In this paper we show how to perform this 
in the context of structural kernels by using 
Gaussian Processes. Experimental results on 
tree kernels show that this procedure results 
in better prediction performance compared to 
hyperparameter optimization via grid search. 
The framework proposed in this paper can be 
adapted to other structures besides trees, e.g., 
strings and graphs, thereby extending the util¬ 
ity of kernel-based methods. 


1 Introduction 

Kernel-based methods are a staple machine learning 
approach in Natural Language Processing (NLP). 
Frequentist kernel methods like the Support Vector 
Machine (SVM) pushed the state of the art in many 
NLP tasks, especially classification and regression. 
One interesting aspect of kernels is their ability to 
be defined directly on structured objects like strings, 
trees and graphs. This approach has the potential to 
move the modelling effort from feature engineering 
to kernel engineering. This is useful when we do 
not have much prior knowledge about how the data 


behaves, as we can more readily define a similarity 
metric between inputs instead of trying to character¬ 
ize which features are the best for the task at hand. 


Kernels are a very flexible framework: they can 
be combined and parameterized in many different 
ways. Complex kernels, however, lead to the prob¬ 
lem of model selection, where the aim is to obtain 
the best kernel configuration in terms of hyperpa¬ 
rameter values. The usual approach for model selec¬ 
tion in frequentist methods is to employ grid search 
on some development data disjoint from the training 
data. This approach can rapidly become impracti¬ 
cal when using complex kernels which increase the 
number of model hyperparameters. Grid search also 
requires the user to explicitly set the grid values, 
making it difficult to fine tune the hyperparameters. 
Recent advances in model selection tackle some of 
these issues, but have several limitations (see ^for 
details). 


Our proposed approach for model selection re¬ 


lies on Gaussian Processes (GPs) (Rasmussen and 


Williams, 20061, a widely used Bayesian kernel ma¬ 


chine. GPs allow efficient and fine-grained model 
selection by maximizing the evidence on the training 
data using gradient-based methods, dropping the re¬ 
quirement for development data. As a Bayesian pro¬ 
cedure, GPs also naturally balance between model 
capacity and generalization. GPs have been shown 
to achieve state of the art performance in various re¬ 
gression tasks (Hensman et ah, 2013[|Cohn and Spe- 


cia, 20131. Therefore, we base our approach on this 


framework. 


While prediction performance is important to 
consider (as we show in our experiments), we are 






mainly interested in two other significant aspects 
that are enabled by our approach: 

• Gradient-based methods are more efficient than 
grid search for high dimensional spaces. This 
allows us to easily propose new rich kernel ex¬ 
tensions that rely on a large number of hyper¬ 
parameters, which in turn can result in better 
modelling capacity. 

• Since the model selection process is now fine¬ 
grained, we can interpret the resulting hyperpa¬ 
rameter values, depending on how the kernel is 
defined. 


In this work we focus on tree kernels, which have 
been successfully used in a number of NLP tasks 
(see In most cases, these kernels are used as an 
SVM component and model selection is not consid¬ 
ered an important issue. Hyperparameters are usu¬ 
ally set to default values, which work reasonably 
well in terms of prediction performance. However, 
this is only possible due to the small number of hy¬ 
perparameters these kernels contain. 

We perform experiments comprising synthetic 
data (Q and two real NLP regression tasks: Emo¬ 
tion Analysis ( §5.1| ) and Translation Quality Estima¬ 
tion ( §5.2| ). Our findings show that our approach out¬ 
performs SVMs using the same kernels. 


2 Gaussian Process Regression 

Our definition of GPs closely follows that of 


Rasmussen and Williams (2006| |. Consider 
a setting where we have a dataset X = 
{(xi,yi), (x 2 ,y 2 ), • • •, (x„,?/„)}, where Xj is a d- 
dimensional input and yi the corresponding out¬ 
put. Our goal is to infer an underlying function 
/ : > M to explain this data, i.e. /(xj) ss y^. 

Eormally, / is drawn from a GP prior. 


/(x) ~ gV{p{x),k{yL,x)), 


where ^(x) is the mean function, which is usually 
the 0 constant, and fc(x, x') is the kernel function. 

In a regression setting, we assume that the res¬ 
ponse variables are noisy latent function evaluations, 
i.e., yi = + T], where y ~ AA(0,cr^) is 

added white noise. We assume a Gaussian likeli¬ 
hood, which allows us to obtain a closed formula 


solution for the posterior, namely 

7/* ~ AA(k*(K -h 

A:(x*,x*) - k^(K -h f7nl)“^k*), 

where x* and y* are respectively the test input 
and its response variable, K is the Gram matrix 
corresponding to the training inputs and k* = 
[(xi,x*), (X2,X*), ..., (x„,x*)] is the vector of 
kernel evaluations between the test input and each 
training input. 

A key property of GP models is their ability to 
perform efficient model selection. This is achieved 
by employing gradient-based methods to maximize 
the marginal likelihood, 

p(y|X,0) = J p{y\X,ej)p{f)df, 

where 6 represents the vector of model hyperparam¬ 
eters and y is the vector of response variables from 
the training data. Eor a Gaussian likelihood, we can 
take the log of the expression above to obtain in 
closed- forrrQ 

logp(y|X,0) = 

-^y^G"V -^log|G| -^log27r 

^ V S/ ^ V/ 

data fit complexity penalty constant 


where G = K -I- (T„I. The data fit term is dependent 
on the training response variables, while the com¬ 
plexity penalty term relies only on the kernel and 
training inputs. Since the first two terms have con¬ 
flicting objectives, optimizing the log marginal like¬ 
lihood will naturally achieve a compromise and thus 
limit overfitting (without the need for any validation 
step or additional data). 

To enable gradient-based optimization we need to 
derive the gradients w.r.t. the hyperparameters: 


^logp(y|X,6>) =iy^G ^|^G ^ 


de 


dO, 


-trace 

2 


.-idG 

' dOi 


* See Rasmussen and Williams (2006 pp. 113-114) for de¬ 
tails on the derivation of this formula and also its correspondent 
gradient calculation. 








The gradients of G depend on the underlying ker¬ 
nel. Therefore we can employ any kind of valid 
kernel in this procedure as long as its gradients can 
be computed. This not only allows for fine-tuning 
of hyperparameters but also allows for kernel exten¬ 
sions which are richly parameterized. 


3 Tree Kernels 


The seminal work on Convolution Kernels by Haus- 
sler (1999l l defines a broad class of kernels on dis¬ 
crete structures by counting and weighting the num¬ 
ber of substructures they share. Applying Haussler’s 
formulation to trees we reach a general formula for 
a tree kernel between two trees ti and t 2 , namely 


kih,t 2 ) = ^ m(/)ci(/)c 2 (/), 
/ 6 .F 


( 1 ) 


where T is the set of all tree fragments, ci(/) and 
C 2 (/) return the counts for fragment / in trees ti and 
t 2 , respectively, and w{f) assigns a weight to frag¬ 
ment /. In other words, we can consider the ker¬ 
nel a weighted dot product over vectors of fragment 
counts. The actual fragment set is deliberately 
left undefined: different concepts of tree fragments 
define different tree kernels. 

In this paper, we will focus on Subset Tree Ker¬ 
nels (henceforth SSTK), first introduced by [Collini 
and Duffy (2001 1 . This kernel considers tree frag¬ 
ments that contains complete grammar rules (see 
Figure [T]for an example). Consider the set of nodes 
in the two trees as and N 2 respectively. We de¬ 
fine Ii{n) as an indicator function that returns 1 if 
fragment fi £ T has root n and 0 otherwise. A 
SSTK can then be defined as: 


Tree 

Eragments 

S 

A B 

S S S 

S 

A 

1 

AAA 

A 

A B 

a b 

A B A B A B 

A B 

a b 


a b 

a b 


Figure 1: An example tree and the respective set of tree 
fragments defined by a SSTK. 


hyperparameter A. The original goal of A is to act 
as a decay factor that penalizes contributions from 
larger fragments cf smaller ones (and therefore, it 
should be in the [0,1] interval). Without this factor, 
the resulting distribution over tree pairs is skewed, 
giving extremely large values when trees are equal 
and rapidly decreasing for small differences over 
fragment counts. The decay factor helps to spread 
this distribution, effectively giving smaller weights 
to larger fragments. 

The function A can be defined recursively. 


A(ni,n 2 ) 


/ 

0 pr(ni) 7 ^ pr(n 2 ) 

A pr(ni) = pr(n 2 ) A 

preterm(ni) 
Xg{ni,n 2 ) ofherwise. 


where pr(n) is the grammar production at node n 
and preterm(n) returns true if n is a pre-terminal 
node. The function g is defined as follows: 


lm| 

5(ni,n2) = A(4^,c^J), (3) 

i=l 


where 


k{ti,t2) = E E A(ni,n2), (2) 

niSWi n2&N2 

A^ Ii{ni)Ii{n2) 


2=1 


and s(z) is fhe number of fragments in i with at least 
one chil4El 

The formulation in Equation is the same as the 
one shown in Equation [T] except that we are now 
restricting the weights w{f) to be a function of a 


"See 


this derivation. 


Pighin and Moschitti (2010]( for details and a proof on 


where |n| is the number of children of node n and 
c\ is the child of node n. This recursive defini¬ 
tion is calculated efficiently by employing dynamic 
programming to cache intermediate A results. 

Equation also adds another hyperparameter, a. 
This hyperparameter was introduced by [Moschitti 


(2006bl f 


as a way to select between two differ¬ 
ent tree kernels. If a = 1, we get the original 
SSTK, if a = 0, then we obtain the Subtree Kernel, 
which only allows fragments with terminal symbols 


^In his original formulation, this hyperparameter was named 
o but here we use a to not confuse it with the GP noise hyper¬ 
parameter. 






















as leaves. We can also interpret the Subtree Kernel 
as a “sparse” version of the SSTK, where the “non¬ 
subtree” fragments have their weights equal to zero. 

Even though fragment weights are affected by 
both kernel hyperparameters, previous work did not 
discuss their effects. The usual procedure fixes a to 
1 (selecting the original SSTK) and sets A to a de¬ 
fault value (around 0.4). As explained in ^ the GP 
model selection procedure enables us to learn fine¬ 
grained values for these hyperparameters, which can 
lead to better performing models and aid interpreta¬ 
tion. Furthermore, it also allows us to extend these 
kernels by adding new hyperparameters. We pro¬ 
pose one such kernel in the next Section. 
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fc(f, t) 

6 

3 

18 


Table 1; Resulting fragment weighted counts for the ker¬ 
nel evaluation fc(f, t), for different values of hyperparam¬ 
eters, where t is the tree in Figure [T] 


3.1 Symbol-aware Subset Tree Kernel 


While varying the SSTK hyperparameters can lead 
to different weight schemes, they do that in a very 
coarse way. For some applications, it may be nec¬ 
essary to give more weight to specific fragments 
or set of fragments (e.g., NPs being more impor¬ 
tant than ADVP in an information extraction set¬ 
ting). The Symbol-aware Subset Tree Kernel (hence¬ 
forth, SASSTK), which we introduce here, allows a 
more fine-grained control over the weights by em¬ 
ploying one A and one a hyperparameter for each 
non-terminal symbol in the training data. The calcu¬ 
lation uses a similar recursive formula to the SSTK, 
namely: 


A(ni,n2) 


/ 

0 pr(ni)/pr(n2) 

Ax pr(ni) = pr(n 2 ) A 

preterm(ni) 

Kdxinx.nf) otherwise. 


where x is the symbol at node ni and 


|n.i| 

gxini,n2) = '[lia^ +. (4) 

i=l 


The SASSTK can be interpreted as a generaliza¬ 
tion of the SSTK: we can recover the latter by tying 
all A and setting all a = 1. By employing different 
hyperparameter values for each specific symbol, we 
can effectively modify the weights of all fragments 
where the symbol appears. Table [T] shows an exam¬ 
ple where we unrolled a kernel computation into its 
corresponding feature space, showing the resulting 
weighted counts for each feature. 


3.2 Kernel Gradients 

To enable hyperparameter optimization via gradient 
descent we must provide gradients for the kernels. 
In this Section we derive the gradients for SASSTK. 

From Equationj^we know that the kernel is a dou¬ 
ble summation over the A function. Therefore all 
gradients are also double summations, but over the 
gradients of A. We can obtain these in a vectorized 
way, by considering the gradients of the hyperpa¬ 
rameter vectors A and a over A. Fet k be the num¬ 
ber of symbols considered in the model and A and 
Q be /c-dimensional vectors containing the respec¬ 
tive hyperparameters. 

In the following, we use the notation A* as a 
shorthand for A(c^^, we also omit the pa¬ 

rameters of gx- We start with the A gradient: 

0 pr(ni) 7^pr(n2) 

u pr(ni) = pr(n 2 ) A 

preterm(ni) 

d{\xgx) , 

——— otherwise, 

oX 

where x is the symbol at ni, gx is defined in Equa¬ 
tion |4] and u is the fc-dimensional unit vector with 
the element corresponding to symbol x equal to 1 
and all others equal to 0. The gradient in the third 
case is defined recursively. 



d{Xxgx) 

d\ 


'tigx T -^x 


dgx 

dX 

hil 


'tigx T -^x 
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2=1 


Qx d^i 
Otx + 














The a gradient is derived in a similar way, 

0 pr(ni) / pr(n2) V 

preterm(ni) 

otherwise, 
oa 

and the gradient at the second case is also defined 
recursively, 

di^x9x') _ , dgx 

da ^ da 



Gradients can be efficiently obtained using dy¬ 
namic programming. In fact, they can be calculated 
at the same time as A to improve performance since 
they all share many terms in their derivations. Fi¬ 
nally, we can easily obtain the gradients for the orig¬ 
inal SSTK by letting u = 1. 

3.3 Kernel Normalization 

It is common practice when using tree kernels to nor¬ 
malize the kernel. This helps reduce the random ef¬ 
fect of tree size. Normalization can be achieved us¬ 
ing the following, where k is the normalized kernel: 

Hh,t2)= , . 

To apply this normalized version in the optimiza¬ 
tion procedure we must also derive gradients for the 
normalization function. In the following equation, 
we use kij and kij as a shorthand for k{ti,tj) and 
k{ti, tj), respectively: 

dki2 dkii dk22 

dki2 _ Qe do ^ “ de 

do y/kiik22 ‘^knk22 


dA 

da 


3.4 Other Extensions 


Many other structural kernels rely on recursive def¬ 
initions and dynamic programming to perform their 
calculations. Examples include other tree kernels 
like the Partial Tree Kernel ( Moschitti, 2006a I and 
string kernels like the ones defined on characfer n- 


grams (Lodhi ef ah, 20021 or word sequences (Can- 


cedda ef ah, 20031. While in this paper we focus 


on the SSTK (and our proposed SASSTK), our ap¬ 
proach can easily be extended to these other kernels, 
as long as all the corresponding recursive definitions 
are differentiable. 

4 Synthetic Data Experiments 

A natural question that arises in the proposed 
method is how much data is needed to accurately 
learn the kernel hyperparameters. To answer this 
question, we run a set of experiments using synthetic 
data. We generate this data by using a set of 1000 
natural language syntactic trees, where we fix a ran¬ 
dom subset of 200 instances for testing and use the 
remaining 800 instances as training. For each train¬ 
ing set size we define a GP over the full dataset, sam¬ 
ple a function from it and use the function output as 
the response variable for each tree. We try two dif¬ 
ferent GP priors, one using the SSTK and another 
one using the SASSTK. 

The conditions above provide a controlled envi¬ 
ronment to check the modelling capacities of our ap¬ 
proach since we know the exact distribution where 
the data comes from. The reasoning behind these 
experiments is that to be able to provide benefits in 
real tasks, where the data distribution is not known, 
our models have to be leamable in this controlled 
setting as well using a reasonable amount of data. 

Finally, we also provide an empirical evaluation 
comparing the speed performance between our ap¬ 
proach and grid search. 

4.1 SSTK Prior 

Our first experiments use a SSTK as the kernel with 
A = 0.001, a = 1 and cj^ = 0.01. After obtaining 
the input trees and their sampled labels, we define a 
new GP model using only the training data plus the 
obtained response variables, this time using a SSTK 
with randomized hyperparameter values. Then we 
optimize the GP and check if the learned hyperpa¬ 
rameters are close to the original ones, using 10 ran¬ 
dom restarts to limit the effect of local optima. We 
also use the optimized GP to predict response vari¬ 
ables on the test set and measure Root Mean Squared 
Error (RMSE). Our hypothesis is that with a reason¬ 
able sample size we can retrieve the original hyper¬ 
parameter values and obtain low RMSE. For each 
training set size, we repeat the experiment 20 times. 




















Figure shows the results of these experiments. 
For small sizes the variance in the resulting hyperpa¬ 
rameter values is large but as soon as we reach 200 
instances we are able to retrieve the original values 
with high confidence. In other words, in an ideal set¬ 
ting 200 instances are enough to learn the kernel. It 
is also interesting to note that test RMSE after opti¬ 
mization steadily decreases as we increase training 
data size. This shows that if one is more interested 
in predictions themselves, it is still worth optimizing 
hyperparameters even if the training data is small. 



Figure 2: Results of synthetic experiments optimizing 
SSTK. The x axes correspond to different training set 
sizes and the the y axes are the obtained hyperparame¬ 
ter values in the first three plots and RMSE in the last 
plot. Dashed lines show the original hyperparameter val¬ 
ues. Points are offset in RMSE chart for legibility. 


4.2 SASSTK Prior 

The large number of hyperparameters of the 
SASSTK makes it more prone to optimization and 
overfitting issues when compared to the SSTK. This 
raises the question of how much data is needed to 
justify its use. To address this question, we run sim¬ 
ilar experiments to those above for the SSTK, except 
that now we sample from a GP using a SASSTK as 
the kernel. 

Instead of optimizing all hyperparameters freely 
we use a simpler version where we tie A and a for 
each symbol to the same value, except for the sym¬ 
bol ’S’. Effectively this version has one extra A and 
one extra a (henceforth A5 and as) when compared 
to the SSTK. The GP prior hyperparameter values 
are set to A = 0.001, Xs = 0.5, a = 0.1, 05 = 1 
and (T^ = 0.01. Eor each training set size, we train 
two GPs, one using this SASSTK and one using 
the original SSTK, optimize them using 10 random 
restarts and measure RMSE on the test set. 

Results are shown in Eigurej^ Eor all training set 
sizes the SASSTK reaches lower RMSE than SSTK, 
with a substantial difference after reaching 100 in¬ 
stances. This shows that even for small datasets our 
proposed kernel manages to capture aspects which 
can not be explained by the original SSTK. Note that 
this is an ideal setting, and real datasets may need to 
be larger to realize gains from SASSTK. Neverthe¬ 
less, these are promising results since they give evi¬ 
dence of a small lower bound on the dataset size for 
SASSTK to be effective. 



Eigure 3; Results from synthetic experiments comparing 
SSTK and SASSTK. The x axis is training set size while 
the y axis corresponds to RMSE. 






































4.3 Performance Experiments 

To provide an overview of how efficient is the 
gradient-based method compared to grid search we 
also run a set of experiments measuring wall clock 
training time vs. RMSE on a test set. For both GP 
and SVM models we employ the SSTK as the kernel 
and we use the same synthetic data from the previ¬ 
ous experiment^ We perform 20 runs, keeping the 
test set as the same 200 instances for all runs and 
randomly sampling 200 instances from the remain¬ 
ing instances as training data. 

Figure shows the curves for both GP and SVM 
models. The GP curve is obtained by increasing the 
maximum number of iterations of the gradient-based 
method (in this case, F-BFGS) and the SVM curve 
is obtained by increasing the granularity of the grid 
size. 



Figure 4; Results from performance experiments. The x 
axis corresponds to wall clock time in seconds and it is in 
log scale. The y axis shows RMSE on the test set. The 
blue dashed line corresponds to the RMSE value obtained 
after L-BEGS converged. Error bars are obtained by mea¬ 
suring one standard deviation over the 20 runs made in 
each experiment. 

We can see that optimizing the GP model is con¬ 
sistently much faster than doing grid search on the 
SVM model (notice the logarithmic scale), even 
though it shows some variance when letting F-BFGS 
run for a larger number of iterations. The GP model 
also is able to better predictions in general. Even 
when taking the variances into account, grid search 
would still need around 10 times more computation 

"'For specific details on the SVM models used in all experi¬ 
ments performed in this paper we refer the reader to Appendix 

0 


time to achieve the same predictions obtained by the 
GP model. In real settings, SVMs predictions tend 
to be more on par with the ones provided by a GP 
(as shown in Q but nevertheless these figures show 
that the GP can be much more time efficient when 
optimizing hyperparameters of a tree kernel. 

An important performance aspect to take into ac¬ 
count is parallelization. Grid search is embarass- 
ingly parallelizable since each grid point can run in 
a different core. However, the GP optimization can 
also benefit from multiple cores by running each ker¬ 
nel computation inside the Gram matrix in parallel. 
To keep the comparisons simpler, the results shown 
in this section use a single core but all experiments in 
^ employ parallelization in the Gram matrix com¬ 
putation level (for both SVM and GP models). 


5 NLP Experiments 


Our experiments with NFP data address two regres¬ 
sion tasks: Emotion Analysis and Quality Estima¬ 
tion. For both tasks, we use the Stanford parser 
( [Manning et ah, 2014 ) to obtain constituency trees 
for all sentences. Also, rather than using data official 
splits, we perform 5-fold cross-validation in order to 
obtain more reliable results. 


5.1 Emotion Analysis 

The goal of Emotion Analysis is to automatically de¬ 


tect emotions in a text (Strapparava and Mihalcea, 


20081. This problem is closely related to Opinion 
Mining ( Pang and Fee, 2008| l, with similar appli¬ 
cations, but it is usually done at a more fine-grained 
level and involves the prediction of a set of labels for 
each text (one for each emotion) instead of a single 
label. 


Beck et al. (2014a I used a multi-task GP for this 


task with a bag-of-words feature representation. In 
theory, it is possible to combine their multi-task ker¬ 
nel with our tree kernels, but to keep the focus of the 
experiments on testing tree kernel approaches, here 
we use independently trained models, one per emo¬ 
tion. 

Dataset We use the dataset provided by the “Af¬ 


fective Text” shared task in SemEval2007 (Strap- 


parava and Mihalcea, 2007j ), which is composed of 
1000 news headlines annotated in terms of six emo¬ 
tions: Anger, Disgust, Fear, Joy, Sadness and Sur- 
































prise. For each emotion, a score between 0 and 100 
is given, 0 meaning total lack of emotion and 100, 
maximally emotional. Scores are mean-normalized 
before training the models. 


even though the optimized marginal likelihood was 
higher. This is evidence that the SASSTKfuii model 
is overfitting the training data, probably due to its 
large number of hyperparameters. 


Models We perform experiments using the follow¬ 
ing tree kernels: 

• SSTK: the SSTK formulation introduced by 
Moschitti (2006b| l; 

• SASSTKfuii: our proposed Symbol-Aware 
SSTK; 

• SASSTK5: same as before, but using only two 
A and two a hyperparameters: one for sym¬ 
bols corresponding to full sentence^ and an¬ 
other for all other symbols. This configuration 
is similar to that in Section 


For all kernels, we also use a variation fixing fhe a 
hyperparamefers fo 1 fo emulafe fhe original SSTK. 

Baselines and evaluation Our resulfs are com¬ 
pared againsf fhree baselines: 

• SVM SSTK: a SVM using an SSTK kernel. 

• SVM BOW: same as before, buf using an RBF 
kernel wifh a bag-of-words represenfafion. 

• GP BOW: same as SVM BOW buf using a GP 
insfead. 


The SVM models are frained using a wrapper for 
LIBSVlvj^ ( jChang and Lin, 2001 ) provided by fhe 
scikif-leam foolkij^ (Pedregosa ef ah, 20111 and op¬ 
timized via grid search. Following previous work, 
we use Pearson’s correlation coefficienl as evalua¬ 
tion mefric. Pearson’s scores are obfained by con- 
cafenafing all six emofions oufpufs fogefher. 

Table |2] shows fhe resulfs. The besf GP model 
wifh free kernels oufperforms fhe SVMs, showing 
thaf fhe fine-grained model selecfion procedure pro¬ 
vided by fhe GP models is helpful when dealing wifh 
free kernels. However, using fhe SASSTK models 
do nol help in fhe case of free a and fhe SASSTKfuii 
acfually performs worse fhan fhe original SSTK, 


^In this dataset, symbols are S, SQ, SBARQ and SINV. 
^www.csie.ntu.edu.tw/~cjlin/libsvm 
'http://scikit-learn.org 



Pearson’s 

SVM BOW 

0.5690 

SVM SSTK 

0.5254 

GPBOW 

0.5891 

(free a) 

GP SSTK 

0.5713 

GP SASSTKfuii 

0.5118 

GP SASSTK 5 

0.5710 

(fixed a = 1) 

GP SSTK 

0.5093 

GP SASSTKfuii 

0.5435 

GP SASSTK 5 

0.5225 


Table 2: Pearson’s correlation scores for the Emotion 
Analysis task (higher is better). 

Another interesting finding in Table is fhaf fix¬ 
ing fhe a values oflen harms performance. Inspecf- 
ing fhe free a models showed fhaf fhe values found 
by fhe opfimizer were very close fo zero. This in- 
dicafes fhaf fhe model selecfion procedure prefer 
towards giving smaller weighfs fo incomplete free 
fragmenls. We can inferpref fhis as fhe model se¬ 
lecting a more lexicalized fealure space, which also 
explains why fhe GP RBF model on bag-of-words 
performed fhe besf in fhis fask. 

Finally, fo undersfand how fhe optimized kernels 
could provide more inferprefabilily. Table shows 
fhe fop 15 A values obfained by fhe SASSTKfuii 
(fixed a varianf) wifh fheir corresponding symbols. 
In fhis specific case fhe kernel does nol give fhe besf 
performance so Ihere are limilafions in doing a full 
linguislic analysis. Neverlheless, we believe fhis ex¬ 
ample shows fhe polenlial for developing more in- 
ferprefable kernels. This is especially inferesling be¬ 
cause Ihese models fake info accounl a much richer 
fealure space fhan whal if is allowed by paramelric 
models. 


5.2 Quality Estimation 

The goal of Quality Estimation is to provide a qual¬ 
ity prediction for new, unseen machine translated 
texts ( |Blatz et ah, 2004t Bojar et ah, 2014). Exam- 

























JJR 

0.8333 

WHADVP 

0.5004 

VBP 

0.4653 

PRP$ 

0.6933 

QP 

0.5001 

WHNP 

0.4508 

WDT 

0.6578 

JJS 

0.4996 

NN 

0.4274 

RBR 

0.5445 

NNS 

0.4961 

JJ 

0.4021 

VBG 

0.5163 


0.4777 

SQ 

0.4000 


Table 3: Top 15 symbols sorted according to their ob¬ 
tained A values in the SASSTKfun model with fixed a. 
The numbers are the corresponding A values, averaged 
over all six emotions. 


pies of applications include filtering machine trans¬ 
lated sentences that would require more post-editing 
effort than translation from scratch ([Specia et ah, 


20091, selecting the best translation from different 


MT systems ( [Specia et ah, 20l0| ) or between an MT 
system and a translation memory (He et ah, 20101, 


and highlighting segments that need revision (Bach 


et ah, 20TT]). While various quality metrics exist, 


here we focus on post-editing time prediction. 

Tree kernels have been used before in this task 


(with SVMs) by Hardmeier (20111 and Hardmeier 


et al. (20T^. While their best models combine tree 


kernels with a set of explicit features, they also show 
good results using only the tree kernels. This makes 
Quality Estimation a good benchmark task to test 
our models. 


Datasets We use two publicly available datasets 
containing post-edited machine translated sentences. 
Both are composed of a set of source sentences, their 
machine translated outputs and the corresponding 
post-editing time. 


Models Since our data consists of pairs of trees, 
our models in this task use a pair of tree kernels. 
We combine these two kernels by either summing 
or multiplying them. As for underlying tree ker¬ 
nels, we try both SSTK and SASSTK5. As in the 
Emotion Analysis task, we also experiment with a 
set of kernel configurations with the a hyperparam¬ 
eters fixed at 1. We also test models that combine 
our tree kernels with an RBE kernel on a set of 17 


features extracted using the QuEst framework (Spe¬ 


cia et ah, 20131. These features are part of a strong 


baseline model used by the WMT14 shared task. 


Baselines and evaluation We compare our results 
with a number of SVM models: 


• SVM SSTK: same as in the Emotion Analysis 
task, using either a sum (+) or a product (x) of 
SSTKs. 

• SVM RBF: this is an SVM trained on the 17 
features extracted by Quest. 

• SVM RBF SSTK: a combination of the two 
models above. 


For further comparison, we also show results ob¬ 
tained using a GP model and an RBF kernel on the 
QuEst-only features. Following previous work, we 
measure prediction performance using both Mean 
Absolute Error (MAE) and RMSE. 

The prediction results are given in Table They 
indicate a number of interesting findings: 


French-English (fr-en): This dataset, de¬ 


scribed in (Specia, 20111, contains 2524 French 
sentences translated into English and post- 
edited by a novice translator. 


English-Spanish (en-es): This dataset was 
used in the WMT14 Quality Estimation shared 
task ( Bojar et ah, 2014] |, containing 858 sen¬ 
tences translated from English into Spanish and 
post-edited by an expert translator. 


For each dataset, post-editing times are first di¬ 
vided by the translation output length (obtaining the 
post-editing time per word) and then mean normal¬ 
ized. 


• For the fr-en dataset, the GP models combining 
tree kernels with an RBF kernel outperform all 
other models. Results for the en-es dataset are 
less consistent, probably due to the small size 
of the dataset, but on average they are better 
than their SVM counterparts. 

• The SVMs using a combination of kernels 
performs worse than using the RBF kernel 
alone. Inspecting the models, we found that 
grid search actually harms performance. For 
instance, for the fr-en dataset, MAE and RMSE 
for the RBF -1- SSTK x model before grid 
search are 0.4681 and 0.6016, respectively. On 
the other hand, for this dataset all GP models 
achieve better results after optimization. 


























• Unlike in the Emotion Analysis task, fixing a 
results in better performance, even though the 
resulting models have lower marginal likeli¬ 
hood than the ones with free a. The same effect 
happened when comparing the SASSTK mod¬ 
els with the SSTK ones for the en-es dataset. 
Both cases are evidence of model overfitting. 



French-English 

English-Spanish 


MAE 

RMSE 

MAE 

RMSE 

(SVM) 

RBF 

0.4610 

0.5944 

0.7831 

1.0238 

SSTK + 

0.4710 

0.6006 

0.7777 

1.0820 

SSTK X 

0.4681 

0.6016 

0.7884 

1.1044 

RBF SSTK + 

0.5146 

0.6267 

0.8077 

1.0295 

RBF SSTK X 

0.5186 

0.6299 

0.8367 

1.0427 

GPRBF 

0.4555 

0.5830 

0.7842 

1.0735 

(GP free a) 

SSTK + 

0.4789 

0.5912 

0.7551 

1.0281 

SSTK X 

0.4804 

0.5843 

0.7440 

1.0008 

SASSTKs + 

0.4756 

0.5889 

0.8096 

1.0754 

SASSTKs X 

0.4797 

0.5868 

0.7484 

1.0102 

(GP fixed a = 1) 
SSTK + 

0.4694 

0.5808 

0.7614 

1.0019 

SSTK X 

0.4708 

0.5733 

0.7205 

0.9870 

SASSTKs + 

0.4758 

0.5888 

0.8242 

1.0912 

SASSTKs X 

0.4699 

0.5751 

0.7469 

1.0280 

(GP fixed a — 1) 
RBF SSTK + 

0.4408 

0.5651 

0.7591 

1.0469 

RBF SSTK X 

0.4443 

0.5659 

0.7389 

1.0302 

RBF SASSTKs -f 

0.4406 

0.5648 

0.7692 

1.0682 

RBF SASSTKs x 

0.4440 

0.5658 

0.7682 

1.0628 


Table 4: Error scores for the Quality Estimation task 
(lower is better). Results are in terms of post-editing time 
per word. Bold scores are the best ones for each dataset. 


We also inspect the resulting hyperparameters to 
obtain insights about the features used by the model. 
Table shows the optimized A values for the GP 
SSTK models with fixed a for the fr-en dalasel. The 
A values obtained are higher for the target sentence 
kernels than for the source sentence ones. We can 
interpret this as the model giving preference to fea¬ 
tures from the target trees instead of the source trees, 
which is what we would expect for this task. 



^src 

^tgt 

GP SSTK + 

0.1394 

0.3108 

GP SSTK X 

0.1405 

0.2641 


Table 5; Learned hyperparameters for the GP SSTK mod¬ 
els in the fr-en dataset, with a fixed at 1. Asrc and Atgt 
are the hyperparameters corresponding to the kernels on 
the source and target parse trees, respectively. The values 
shown are averaged over the cross-validation results. 


tion for the marginal likelihood does help limiting 
overfitting, it does not prevent it completely. Small 
datasets or invalid assumptions about the Gaussian 
distribution of the data may still lead to poorly fitting 
models. Another means of reducing overfitting is by 
taking a fully Bayesian approach in which hyperpa¬ 
rameters are considered as random variables and are 
marginalized out ( [Osborne, 2010| ); this is a research 
direction we plan to pursue in the future]^ 


5.4 Extensions to Other Tasks 

The GP framework introduced in Section |2| can 
be extended to non-regression problems by chang¬ 
ing the likelihood function. For instance, models 


for classification (Rasmussen and Williams, 2006 


Chap. 3), ordinal regression (Chu and Ghahramani, 
2005| l and structured prediction ( Altun et ah, 2004] 
Bratieres et ah, 2013| ) were proposed in the liter¬ 
ature. Since the likelihood is independent of the 
kernel, a natural future step is to apply the kernels 
and models introduced in this paper to different NLP 
tasks. 

In light of that, we did initial experiments in con¬ 
stituency parsing rerankin^ The first results were 
inconclusive but we do believe this is because we 
employed naive approaches using classification (1- 
best result vs. all) and regression (using PARSEVAL 
metrics as the response variable) models. A more 
appropriate way to tackle this task is by employing a 
reranking-based likelihood and this is a direction we 
plan to pursue in the future. 


5.3 Overfitting 


6 Related Work 


Both NLP tasks show evidence that the GP models 
with large number of hyperparameters (SASSTKfuu 
in the case of Emotion Analysis and the free a mod¬ 
els in Quality Estimation) are overfitting the cor¬ 
responding datasets. While the Bayesian formula- 


interest in model selection procedures for kernel- 
based methods has been growing in the last years. 


*See also 


Rasmussen and Williams (2006 Chap. 5) for an 


in-depth discussion on this issue. 

®We thank the anonymous reviewers for this suggestion. 

































One widely used approach for that is Multiple Ker¬ 
nel Learning (MKL) (Gonen and Alpaydm, 20111. 
MKL is based on the idea of using combinations 
of kernels to model the data and developing algo¬ 
rithms to tune the kernel coefficients. This is dif¬ 
ferent from our method, where we focus on learning 
the hyperparameters of a single structural kernel. An 


approach similar to ours was proposed by Igel et al. 


(2007). They combine oligo kernels (a kind of n- 


gram kernel) with MKL, derive their gradients and 
optimize towards a kernel alignment metric. Com¬ 
pared to our approach, they restrict the length of 
the n-grams being considered, while we rely on dy¬ 
namic programming to explore the whole substruc¬ 
ture space. Also, their method does not take into 
account the underlying learning algorithm. Another 
recent approach proposed for model selection is ran¬ 
dom search (Bergstra and Bengio, 20121. Like grid 
search, it has the drawback of not employing gra¬ 
dient information, as it is designed for any kind of 
hyperparameters (including categorical ones). 

Structural kernels have been successfully em¬ 
ployed in a number of NLP tasks. The original 
SSTK proposed by Collins and Duffy (2001 [ I was 
used to rerank the output of syntactic parsers. Re¬ 
cently, this reranking idea was also applied to dis¬ 
course parsing (Joty and Moschitti, 2014). Other 
tree kernel applications include Semantic Role La¬ 
belling (Moschitti et ah, 20081 and Relation Extrac¬ 
tion (Plank and Moschitti, 20131. String kernels 


were mostly used in Text Classification (Lodhi et 


ah, 2002| Cancedda et ah, 20031, while graph ker¬ 


nels have been used for recognizing Textual Entail- 
ment ( |Zanzotto and Dell’Arciprete, 2009 1. How¬ 
ever, these previous works focused on frequentist 
methods like SVM or voted perceptron while we 
employ a Bayesian approach. 

Gaussian Processes are a major framework 
in machine learning nowadays: applications in¬ 
clude Robotics ( jKo et ah, 2007) , Geolocation 
(Schwaighofer et ah, 2004) and Computer Vision 
(Sinz et ah, 2004). Only very recently they have 
been successfully employed in a few NEP tasks such 


as translation quality estimation (Cohn and Specia, 


2013 Beck et ah, 2014b l, detection of temporal pat¬ 


terns in text ( [Preojiuc-Pietro and Cohn, 2013 ), se¬ 
mantic similarity (Rios and Specia, 2014) and emo¬ 
tion analysis (Beck et ah, 2014a|). In terms of feature 


representations, previous work focused on the vecto¬ 
rial inputs and applied well-known kernels for these 
inputs, e.g. the RBE kernel. As shown on §5.2[ our 


approach is orthogonal to these previous ones, since 
kernels can be easily combined in different ways. 

It is important to note that we are not the first ones 
to combine GPs with kernels on structured inputs. 


Driessens et al. (20061 employed a combination of 


GPs and graph kernels for reinforcement learning. 
However, unlike our approach, they did not attempt 
model selection, evaluating only a few hyperparam¬ 
eter values empirically. 


7 Conclusions 


This paper describes a Bayesian approach for struc¬ 
tural kernel learning, based on Gaussian Processes 
for easy model selection. Experiments applying our 
models to synthetic data showed that it is possible 
to learn structural kernel hyperparameters using a 
fairly small amount of data. Eurthermore we ob¬ 
tained promising results in two NEP tasks, includ¬ 
ing Quality Estimation, where we beat the state of 
the art. Einally, we showed how these rich parame- 
terizations can lead to more interpretable kernels. 

Beyond empirical improvements, an important 
goal of this paper is to present a method that en¬ 
ables new kernel developments through the exten¬ 
sion of the number of hyperparameters. We focused 
on the Subset Tree Kernel, proposing an extension 
and then deriving its gradients. This approach can 
be applied to any structural kernel, as long as gradi¬ 
ents are available. It is our hope that this work will 
serve as a starting point for future developments in 
these research directions. 
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A Details on SVM Baselines 

All SVM baselines employ the e-insensitve loss 
function. Grid search optimization is done via 3- 
fold cross-validation on the respective training set 
and use RMSE as the metric to be minimized. After 
obtained the best hyperparameter values, the SVM 
is retrained using these values on the full respec¬ 
tive training set. The specific intervals used in grid 
search depend on the task. 

For the performance experiments on synthetic 
data, we employed an interval of [10“^, 10] for C 
(regularization coefficient) and e, 1] for A and 

[10“^, 2] for a. In each run we incrementally in¬ 
crease the size of the grid by adding intermediate 
values on each interval. We keep a linear scale for 
the SSTK hyperparameters and a logarithmic scale 
for C and e. As an example. Table shows the re¬ 
sulting grids when the grid value is 4 for each hyper¬ 
parameter. For all NFP experiments the grid is fixed 
for all hyperparameters (including 7, the lengthscale 
value in the RBF kernel), with its corresponding val¬ 
ues shown on Table |7] 


C/e 


A 

[10-^ 0.33, 0.67,1] 

a 

[lO-'^, 0.67,1.33,2] 


Table 6: Resulting grids for the performance experiments 
when grid size is set to 4 for each hyperparameter. 






c 

[10-^1,100] 

e 

[10-^ 10-^ 1,10] 

A 

[10-l^ 0.25,0.5] 

a 

[1] (fixed) 

7 

[10-^ 0.0178,0.316, 5.62,100] 


Table 7: Grid values for the NLP experiments. 




